Supervised Morphology Generation Using Parallel Corpus
نویسندگان
چکیده
Translating from English, a morphologically poor language, into morphologically rich languages such as Persian comes with many challenges. In this paper, we present an approach to rich morphology prediction using a parallel corpus. We focus on the verb conjugation as the most important and problematic phenomenon in the context of morphology in Persian. We define a set of linguistic features using both English and Persian linguistic information, and use an English-Persian parallel corpus to train our model. Then, we predict six morphological features of the verb and generate inflected verb form using its lemma. In our experiments, we generate verb form with the most common feature values as a baseline. The results of our experiments show an improvement of almost 2.1% absolute BLEU score on a test set containing 16K sentences.
منابع مشابه
Modelling Linguistic Phenomena with Unsupervised Morphology for Improving Statistical Machine Translation
This work studies an ascetic approach to statistical machine translation. We assume that only a small parallel corpus is available, and no other monoor bilingual corpora or linguistic tools can be used, which is the case for many resource-scarce languages. Our aim is to find out how a baseline SMT system can be improved under this condition. In such a case one of the natural choices is to use u...
متن کاملStatistical models for unsupervised, semi-supervised and supervised transliteration mining
We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings, unsupervised, semi-supervised and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e. noise). The model is trained on noisy unlabelled da...
متن کاملCross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system
We present a cross-lingual discourse relation analysis based on a parallel corpus with discourse information available only for one language. First, we conduct a corpus study to explore differences in discourse organization between Chinese and English, including differences in information packaging, implicit/explicit discourse expression divergence, and discourse connective ambiguities. Second,...
متن کاملKorean Word-Sense Disambiguation Using Parallel Corpus as Additional Resource
Most previous research on Korean WordSense Disambiguation (WSD) were focusing on unsupervised corpus-based or knowledge-based approach because they suffered from lack of sense-tagged Korean corpora.Recently, along with great effort of constructing sense-tagged Korean corpus by government and researchers, finding appropriate features for supervised learning approach and improving its prediction ...
متن کاملThe Web as a Parallel Corpus
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structur...
متن کامل